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Abstract — We introduce collaborative tagging and faceted 
search on structured P2P systems. Since a trivial and brute 
force mapping of an entire folksonomy over a DHT-based system 
may reduce scalability, we propose an approximated graph 
maintenance approach. Evaluations on real data coming from 
Last.fm prove that such strategies reduce vocabulary noise (i.e., 
representation's overfitting phenomena) and hotspots issues. 

I. Introduction 

Social applications are rapidly popularizing collaborative 
tools for indexing, retrieval, access and distribution of content 
over the Internet. Multimedia resources are made available 
through websites and p2p systems, together with annotations, 
metadata, tags, and other kind of information about the owner 
and/or the content itself. Such information is often used to fill 
the semantic gap between the personal user experience, and 
a more general description of a given resource. Nevertheless, 
such huge volume of information is often hidden to traditional 
search engines, since a common query infrastructure and 
language is missing. 

During the years, the Web community has been supported 
with many retrieval techniques, that can be categorized in 
two main paradigms: navigational search and direct search. 
The first family of strategies assumes the existence of a 
taxonomy, usually predefined by a group of experts, that can be 
iteratively browsed by a user from general categories to more 
specific subclasses of information (e.g., Yahoo! Directory). 
Direct search let the user query the engine by means of a (set 
of) keyword(s) (e.g. Google). Even if the latter has gained a 
vast amount of success during the last years, very recently 
navigational paradigm has emerged again due to the diffusion 
of folksonomies within popular tagging systems (e.g., Flickr, 
del.icio.us, and so on); in fact, folksonomies have been showed 
to overperform monolithic hierarchical classifications in social 
domains where many users with different mental attitudes and 
vocabularies are active. Quite surprisingly, in the p2p domain 
navigational search's benefits have been understimated, and 
few proposals exist in the related literature. In particular, a 
lot of effort has been devoted to direct search strategies, very 
common in unstructured p2p systems, and to exact match key- 
based lookup techniques, that are basically used by almost 
every structured overlay network. Even if some scholars have 
proposed semantic routing for p2p systems (moving from 
the pioneering work of Crespo and Garcia-Molina ifll"). few 
research has been conducted on merging collaborative tagging, 
folksonomies and p2p systems. 



First of all, we need to adopt a general tagging system model 
that can be exploited to define navigational search strategies 
(Section [TlTb - Such model should fit the social media domain 
in the broadest sense of the word, since it could be used to 
implement a high level engine that allows the user to search in 
different environments (web, social networks, p2p file sharing 
networks, and so on). 

A structured p2p system is the natural setting for imple- 
menting such model, because of better scalability, and inherent 
distribution of keys and indexes of resources. Nevertheless, it 
is hard to find a one-to-one mapping of a given folksonomy 
(seen as a network of tags) and a DHT system (that partitions a 
given keyspace among the participating nodes). In Section ITVl 
we propose a way to perform such a mapping, introducing 
an approximation strategy that fits well with dynamic and 
decentralized tagging. 

Finally, a very relevant issue may arise if unbalanced 
distributions of popular tags are used. As we show in Section 
IVl our proposal can be used efficiently in a real domain (i.e., 
Last.fm), due to the approximation strategies cited above. 

II. Related works 

Several other efforts in studying the possible deploying of 
collaborative tagging systems on peer to peer environments 
has been recently made. 

In (2) an efficient indexing scheme for storing and retrieving 
concept hierarchies over a fully decentralized system is given, 
even if folksonomies are not taken into account. 

A p2p infrastructure for tagging systems (PINTS) is pro- 
posed in J3j; in particular, the authors design a scheme to 
maintain feature vectors for characterization of users and 
resource of a tagging environment on a DHT. Feature vectors 
may be useful for calculating the similarity between users or 
for conshucting algorithms for ranked retrieval. 

PINTS comes as a building block of Tagster H, a dis- 
tributed content sharing and tagging system where the user- 
resource-tag graph is stored in a DHT. A dedicated storing 
index is used for each tagging relation, so each edge in the 
graph is stored at different overlay responsibility areas. For this 
reason, one lookup for each edge retrieval is needed, and this 
could make the navigation expensive in systems with a huge 
number of tags and objects. Furthermore, navigational aspects 
between related tags is not explicitly taken into account. 

In T-DHT, an hybrid structured-unstructured p2p ap- 
proach is described. The scheme does not explicitly model a 
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Fig. 1. Bidirectional arcs in a Folksonomy Graph (right) aggregates 
asymmetrically weights in the Tag-Resource graph (left) 
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folksonomy with inter-tag relations and with the possibility to 
navigate through related labels. 

Tag -based navigation is taken into account in [6], where 
a centralized web service discovery system based on folk- 
sonomies is presented. Tags, together with variables, are used 
to assign semantic information to input and output messages 
of the service operations. The key feature of this work is 
the possibility to exploit the subsumption relation between 
variable types to compose the discovery activity as an acyclic 
navigational workflow. 

Even if the work described in [7] is not related to p2p, it 
worths to be cited here because of the link the authors put 
between taxonomies and folksonomies. It has inspired us to 
further insights on query convergence. 

III. Tagging system model 

The first step in the definition of our distributed search 
engine is an high level description of the tagging system; we 
define the tag-resource graph and also a folksonomy graph 
by means of a simple tag similarity measure (Section [ill- Al l, 
and how these graphs are modified during users interaction 
(Section flll-Bb : finally, we show how this model can be used 
for navigating through tags in search of resources (Section 

A. Graphs definition 

Usually, collaborative tagging systems are defined as tri- 
partite hypergraphs JHJ, (9), in which three sets of actors are 
involved: 

• U is the set of users of the system, that actually tag 
resources. 

• T is the set of tags. 

• R is the set of resources being tagged. 

However, since in our work we focus mainly on tags 
and resources, we perform an aggregation across the user 
dimension in order to obtain a bipartite graph that links tags to 
resources. We define such a graph as the Tag-Resource Graph 
(TRG), where TRG = (T U R, E TR ), s.t. (t, r) 6 E TR iff at 
least a user tagged r with t. Moreover, for each arc in Etr we 
define a weight u(t 1 r) that is the number of times r has been 
tagged with t (see Figure[T]on the left). The reader can observe 
that we are adopting the so-called distributional aggregation 
approach [ 1 1 that yields to a graph in which the weight of 
an edge (t, r),t <E T,r G R is equal to the number of users 
tagging r with t. 



Fig. 2. (a) Resource insertion: resource rz, labeled with ti,i2,i3 is inserted 
(b) Tag insertion: tag t-j, is attached to ri- Light arcs and nodes are those 
added during the insertion 



We can use such graph to extract Tags(r) and Res(t) 
denoting, respectively, the subset of tags that label a resource 
r and the subset of resources that have been tagged with t: 

Tags{r) = {t € T\3(t, r) E E TR }, reR (1) 
Res(t) = {r S R\3(t, r) e E TR }, teT (2) 

Since our purpose is to define a tag-based search engine, 
we introduce a simple Folksonomy Graph (FG) that can 
be trivially derived through collaborative tagging. Intertag 
correlations should be detected by means of a distance measure 
between any pair of tags. We interpret such a distance as an 
asymmetric similarity function between two generic tags t-y 
and t2, s.t. sim(ti,t2) = u(t2,r). 

re-Res(ti) 

Roughly, sim(t\,t2) says how many times resources labeled 
with ti have been tagged also with t%. Even if many different 
similarity measures could be adopted in folksonomies iflOll . 
such aggregation of tag-resource weights is a metric that is 
easy to calculate and, as we explain in Section HV] handy 
to be mapped on a fully decentralized and dynamic context. 
This metric can be considered as a generalization of tag-tag 
co-occurrence ifTTI . 

Now, we can define our folksonomy graph as FG = 
(T,Ep), s.t. (ti,t 2 ) G Ep iff sim(t\,t2) > 1. Let us observe 
that, by construction, if sim(t\, t 2 ) ^ 0, then simfo, ii) ^ 0, 
even if it may happen that sim(t\,t2) ^ sim(t2,ti). Hence, 
we represent connections between tags using bidirectional arcs 
with two weights. Finally, we will need to deal with the tags 
related to a given tag t. Such set is the neighborhood of t in 
FG, denoted with N FG (t). 

For example, in Figure [T] (right), the arc (ti, t 2 ) of FG has 
weight 5 because resources r\ and r%, have been tagged also 
with ti by, respectively, 3 and 2 users; let us observe that, 
conversely, sim(t2,ti) = 7. 

B. Graphs maintenance 

The active collaborative behavior of the user community 
leads to a continuous evolution of the TRG and the FG, due 
to the addition of new items and new annotations. 



1 ) Resource insertion: When an user inserts a new item r 
and tags it with T r — {ti, t m }, then a new resource vertex 
is inserted in the TRG. Of course, also for each new tag ti, 
a new vertex is inserted in the TRG, so that R is updated to 
R U {r}, and the set of tags to T U T r . Moreover, for each 
ti G T r , an edge (r, i,) is added to Etr, with u(r, U) = 1. 
As a consequence, FG must be changed, too: for each pair of 
tags ti, tj G T r , a new arc (ti,tj), if not previously existent, is 
added to Ep with sim(ti,tj) = sim(tj,ti) = 1. Otherwise, 
sim(t i: tj) and sim(tj,ti) are simply incremented of one unit. 

2) Tag insertion: Graphs grow also when an existent re- 
source r is tagged with t. Firstly, if t ^ T, a proper tag node 
is added. Therefore, if t ^ Tags(r), then a new edge (t, r), 
with u(t,r) = 1 is added to £t,r; conversely, if t G Tags(r), 
then r) is simply incremented. Similarities are changed 
consequently. For each tag r G Tags(r), sim(T,t) is incre- 
mented by one. Instead, sim(t, r) is changed depending on 
whether t was in Tags(r) before the tagging operation or 
not. If t was in Tags(r), then sim(t, r) is left unchanged, 
otherwise sim(t,T) is incremented by u(t, r). Arcs in the 
FG are created or updated accordingly. 

Examples of both operations are showed in Figure [2] 

C. Faceted Search within the Folksonomy Graph 

Our purpose is to exploit our model in order to let the 
user explore the given multi-dimensional information space 
by iteratively narrowing the number of choices at each search 
step. 

Many popular tagging systems (e.g., Flickr, Last.fm, and 
so on) make use of resource clustering according to some 
measure of similarity. For example, considering our Folk- 
sonomy Graph, we can easily identify clusters representing 
repeated patterns of tags that can be presented to the users 
through lists or tag clouds. Such clusters can be intuitively 
used to refine the query or to disambiguate search keywords. 
Nevertheless, clustering techniques can generate unpredictable 
groups, produce cycles in the navigation process, and limit 
the browsing features of the system since they may not allow 
refinements. 

Generally speaking, users prefer hierarchical classifications 
with clear and meaningful labels at each level of the tree. 
For example, a tag that is presented more than once during 
the same search process can generate confusion, as well as 
a general term (e.g., "rock") that is found in a cluster after 
that a specific tag (e.g., "heavy-metal") has been selected. 
Unfortunately, a traditional and rigorous taxonomy is difficult 
to be provided in a highly dynamic social domain with many 
users with different mental attitudes and vocabularies. 

Faceted search can be seen as a middle ground approach 
that allows the user to "dive" the folksonomy without se- 
mantic cycles and to iteratively refine the tag-resources space 
(e.g., TagExplorer by Yahoo! Research). Accordingly to this 
approach, the user browses the tagging system through a path 
in FG. We can interpret every tag of such a path as a different 
level of a hierarchical faceted search process; in fact, selecting 
subsequent tags in the hierarchy results in a conjunction over 



the selected annotations, and each step zooms in the tag- 
resource space, narrowing the focus of the search. 

Two important consequences of this approach are query 
convergence and vocabulary specialization. Let us assume 
that the user starts the search process selecting tag to, and 
afterwards she chooses t\, t2, ■ ■ ■ , t n . At each step, only co- 
related tags are presented to the user, i.e., we have that ti is 
always a neighbor of in FG, that is ti G Npc(ti-i). 
Moreover, at step i a set of tags Ti and a set of resources Ri 
can be presented to the user: 

T = [Nfg{U) i = , = (Res{to) i = 

1 [Ti-^NpaiU) i>0' 1 \Rt-i nRes(U) i >0 

Even if we are not concerning on presentation aspects, we 
can assume that to improve usability only a subset of Ti is 
displayed (using a tag cloud or an alternative representation), 
and that a subsequent tag selection would be equivalent as 
a zoom in, eventually visualizing other (more specific) tags. 
Obviously, since previously chosen tags are not taken into 
account in subsequent steps, Vi : |T,| < |Tj_i|. The upper 
bound of the iterative process is 0(|To|), and so convergence 
is trivially proved. It can be noted that browsing can be delayed 
by tags that are "semantically equivalent" (i.e., all r G Tj_i 
s.t. | Ti| = |Tj_i|). However, such situations are limited in 
numbers and do not affect significantly search performance. 

IV. Mapping on a DHT 

Next, we present a general insight of how the model defined 
in Section |nl] can be mapped on a Distributed Hash Table. We 
show that a naive implementation of our model would lead to 
grievous inefficiencies which severely limit the scalability of 
the system; therefore, we propose an approximated approach 
to overcome these issues. 

A. Distributed model 

In order to map the folksonomy on a DHT we need to shrink 
the TRG and the FG both in small structural blocks that can 
be stored at different overlay nodes. In particular, each block 
contains a node together with its outgoing edges. Accordingly, 
every resource node r G TRG, together with its outgoing 
edges to nodes t G Tags{r) is contained into a single block. 
Symmetrically, every tag t G TRG with its outgoing edges 
to r G Res{t) forms a block. Likewise, the FG is partitioned 
in blocks containing a tag t with the arcs that links it to its 
neighbors in Npc{t). More formally, we define four types of 
blocks: 

1) r : {(t,u(t,r))\t G Tag(r)},r € R 

2) t : {(r,u{t,r))\r G Res(t)},t G T 

3) i : {(t',sim(t,t'))\t' G N FG (t)},te T 

4) r : (r,URI(r)),r G R 

The TRG is split into blocks of type 1 and 2, the FG 
into blocks of type 3. Type 4 blocks are introduced only to 
conceptually associate the resource itself (a URI of a generic 
object or service) to its name r (a human readable identifier 
which denotes the resource). Each block is mapped on a 
lookup key computed from the name of its node concatenated 
with a string which determines the block type (e.g. the hash 




Fig. 3. Folksonomy graph mapping on the DHT 



of t\"2" is the key of type 2 block for tag t). For brevity, we 
denote with f, t, t, f the lookup keys for blocks of type 1- 
4; for simplicity, we use this notation to directly denote the 
blocks without introducing any ambiguity. 

Figures [3] and [4] shows how the FG and the TRG depicted 
in Figure [2]b are partitioned in blocks and mapped on a 
generic DHT layer. Overlay nodes are labeled with the key 
they are responsible for; the content of overlay nodes' storages 
are depicted into the baloons. Type 4 blocks are omitted for 
simplicity. 

Given such mapping, navigation, tagging and resource in- 
sertion through the p2p network are easy to describe. At each 
navigation step, when a tag t is selected, tags and resources 
related to t are retrieved by fetching blocks t and i; intersection 
with tag and resources set retrieved in following steps are 
performed locally. Insertion of a resource r, marked with tags 
ti, i £ l..m, requires the creation of block f to store the URI 
and of block f to connect the resource with its tags. Reverse 
tag-resource connections are mapped by inserting blocks tj for 
each tag i; given in input. Each ti should then be connected to 
others in the FG by creating (or updating) its block ti. Finally, 
when a resource r is tagged with a label t, the weight of 
edge (t, r) is incremented by updating blocks f and i. Then, 
tags r € Tags(r) are retrieved from block f. For every t, 
the weights of arc (i, r) is incremented by updating block t, 
while reverse connections (r, t) must be updated by modifying 
blocks f,Vr G Tags(r). 

We suppose that retrieving or modifying the content of a 
block on the DHT costs only one overlay lookup operation. 
This assumption is reasonable if the overlay is equipped with 
proper PUT and GET operations, which, respectively, insert 
and retrieve contents from the DHT by exploiting the overlay 
network's lookup service. In particular, we suppose that a 
block's structure is modified only by the addition (or, possibly, 
deletion) of one-bit tokens, which determines a unit increment 
of a particular arc in the TRG (or FG). This approach leads to 
a more simple implementation, suitable for any DHT; however 
we omit implementation details due to space limitations. We 
implemented such primitives on Likir [12], based on Kademlia 
[ 1 3 1 . Given this assumptions, we can easily calculate the cost 
of each basic operation on the folksonomy; results are listed 




Fig. 4. Tag Resource graph mapping on the DHT 



in the first row of Table U 

B. Approximated approach 

Implementing the algorithms defined in Section IIII-BI with 
our distributed framework produces two severe issues, both 
concerning the tagging operation (i.e. the FG update). 

The first is a complexity problem. We stated that when a 
new tag t is added to a resource r the weights of the arcs 
(t, t),T £ Tags(r) must be updated. In the DHT domain this 
implies the update of blocks f of each r G Tags{r). Accord- 
ingly, a number of lookups which is linear with \Tags(r)\ is 
performed. This cost is unsustainable because, as we show 
later in Section [V] a resource can be tagged with several 
hundred labels. Even if different lookups can be executed in 
parallel, the bandwidth usage would be definitely excessive 
for such simple and frequent operations. 

The second is a consistency problem, caused by a race 
condition. To keep the graph consistent with our model, if 
the arc (t,r),T £ Tags(r) was not present before new tag 
t insertion, then sim(t,T) should be incremented by u(r,r). 
Nevertheless, it is hard to implement correctly this practice in 
a fully decentralized system. It is easy to understand, indeed, 
that if two users try to add simultaneously the same tag t on the 
same resource r, there is the risk that the value of sim(t,T), 
for any r £ Tags(r), is uncorrectly incremented twice, for a 
total value of 2 • u(t, r). 

These considerations must be taken into account to improve 
the algorithm design. We adopted two approximated strategies 
to solve these problems; call t the new tag and r the resource 
to be labeled. 

Approximation A. Instead of incrementing the weights of all 
the arcs (r, t),r £ Tags(r), perform the increment only for a 
random subset of Tags(r). The cardinality of such subset can 
be chosen to be at most a constant number fc; this expedient 
reduces the number of lookups needed for a tagging operation, 
preventing its complexity to scale with \Tags(r)\. We refer to 
k as the connection parameter of the approximated graph □ 
Approximation B. If the arc (t,r),T £ Tags(r) was not 
present before the tagging operation, then increment the weight 
of (t, t) only by one (and not by u(t, r)). This avoids the 
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Insert (r, ti.. m ) 


Tag (r,t) 


Search step 


#lookups (naive) 


2 + 2m 


4+ \Tags(r)\ 


2 


#lookups (approx.) 


2 + 2m 


4 + fe 


2 



TABLE I 

Distributed tagging system primitives cost 



possible inconsistencies due to simultaneous addition of a new 
tag t to resource r □ 

Approximations make the similarity graph evolve differently 
from the abstract model described in Section [Til] Thus, the 
distance from the theoretic and the mapped graph should be 
measured to check how much the search procedure is affected 
by our approximations. In Section [V] we present experimental 
results to measure such distance. It is worth noting that only 
the FG is affected by the approximation, while the TRG graph 
remains the same. Complexity of approximated operations in 
terms of overlay lookups is shown in the second row of Table 
11 

The source code of the distributed tagging application 
that implements the approximated approach is available on- 
lin^H with the name of DHARMA (DHT-based Approach for 
Resource Mapping through Approximation). An implementa- 
tion of the underlying DHT is available as well. 

V. Evaluation 

We give an evaluation on how the approximations intro- 
duced in our system design impact on the validity of the model 
using analytic and simulative approaches. The analysis is 
based on a dataset extracted from Last.fm. First (Section lV-Ah 
we give a brief description of the main features of the dataset, 
then (Section IV-Bt we analyze how the FG created through 
the protocol we defined in Section IIV-BI well approximates 
the theoretic similarity model of the dataset and we show 
that user search experience does not decay due to introduced 
approximations. Additionally, in Section IV-CI we report the 
results of a simulative experiment aimed at the estimation of 
the mean number of steps needed for query convergence. 

We base our experimental analysis on a snapshot of the 
Last.fm web site collected from January 2009 to April 2OOS0. 
We explored a population of 99405 active users, extracting 
nearly 11 millions of annotations in the form of triples 
{user, item, tag) where an item can be an artist, an album or 
a specific song. From this raw dataset, we built the bipartite 
TRG, which has 1413657 resource nodes and 285182 different 
tags, from which we derived the FG. 

A. Last.fm dataset overview 

We analyzed some structural properties of TRG and FG 
both. Some of the most relevant things to know are nodal 
degree distributions: in particular we extracted the distribution 
of the cardinalities of Tags(r), Res{t) and Npaif) sets- 
Statistics of degrees (mean, standard deviation and max values, 

1 http://likir.di.unito.it 

2 The dataset has been collected in collaboration with the School of 
Informatics and Computing, Indiana University at Bloomington, IN, USA. 
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Fig. 5. Last.fm nodal degree CDF 



Degree 


Tags(r) 


Res(t) 


N FG {t) 




5 


26 


316 


cr 


13 


525 


1569 


max 


1182 


109717 


120568 



TABLE II 
Last.fm graph degree statistics 



all rounded to integer) are shown in Table [TT] and cumulative 
degree distributions are depicted in Figure [5] 

A strong core-periphery structure emerges in the TRG. In 
particular, a huge portion of tags (about 55%) marks only 1 
resource and almost the 40% of resources are labeled with 
just 1 tag. Conversely, the dataset has a core of much more 
connected tags and resources; these correspond to the semantic 
top-level (or at least high-level) tags (e.g. "rock", "pop", "seen 
live") and to the most popular resources. 

A similar scenario comes with the FG: the 80% of tags has 
a not-null similarity with at most one or two hundred nodes, 
while the nodes belonging to the core of most popular tags 
are connected with several thousand nodes. 

Given this setting, it is clear that the updates and lookup 
operations performed within the core structures of the network 
are the most "problematic" in terms of DHT operations. First, 
the number of tags marking a resource can be too high to 
avoid Approximation A. Second, given a very popular tag, the 
number of related tags and resources can be definitely huge; 
since, usually, overlay messages are sent on UDP packets, 
the limited payload force to send only a subset of tags 
and resources available during a search step. Therefore it is 
important that only the most relevant objects are returned; 
however, the definition of the DHT GET primitive can easily 
be adapted with proper index side filtering options in order to 
meet this requirement. 

B. Approximated graph simulation 

Given the Last.fm TRG and FG we simulate the evolution of 
such graphs with our approximated protocol in order to draw 
a comparison between the real dataset and the approximated 
one. 
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Fig. 6. Comparison between original and simulated FGs' nodal degree 



Fig. 8. Comparison between original and simulated FGs' arcs weights 



The simulation starts with a fully disconnected graph that 
includes all tags and resources from the Last.fm dataset. 
At each step, a resource r and a tag t are selected and a 
tagging operation is performed. The FG is updated according 
to Approximations A and B. Resource r is chosen with a 
probability proportional to its popularity in the dataset (i.e. 
\Tags(r)\ in the real TRG); tag t is selected between all tags 
in Tags(r) on a local popularity basis (i.e. with probability 
proportional to u(t,r)). Simulation ends when resources are 
labeled with all their related tags instances that appear in the 
real dataset. We executed the simulation for different values 
of connection parameter k. 

We compare the original and the simulated FGs. We con- 
sider nodes out-degree and arcs weight (i.e. the tag-tag sim- 
ilarity values) values in original graph against corresponding 
values in simulated graph: the result are depicted in Figures 
[6] and [8] We notice that, even with k = 1, the points on 
the degree plot are aligned on a line whose slope is close 
to the diagonal; so we deduce that the variation of k does 
not significantly affects the nodal degree. On the contrary, 
arcs' weight is significantly reduced for low values of k; to 
reduce the spread with the original values under a reasonable 
threshold, k must be set to values that would make an efficient 
implementation on a DHT system unfeasible. 

Nevertheless, what we are interested in is not to minimize 
the residues between theoretic and actual arcs' weight, but 
our aim is that some kind of proportions are kept. First, we 
want that the arcs' weight ordering is maintained because the 
ranking of the sim(ti,t2) weights directly influences the tags' 
ranking that is displayed during the search process. Second, we 
want that the proportion between the weight of every pair of 
arcs is not lost; if weight ordering is preserved for a pair of arcs 
but the ratio between their values significantly changes in the 
simulated network, then there is the risk of a flattening effect 
on the tag similarity values, thus reducing the information 
provided to the user in the search step. 

To give a quantitative measure of such advisable features, 



we compared, for each tag t in the dataset, the set of its 
outgoing arcs (t,tj),£j s Np(j(t) with the same set taken 
from the approximated graph. The metrics we used for the 
arcs weights comparison are the Kendall's tau rank correlation 
coefficient (K T ) and the cosine similarity (8). K T evaluates the 
similarity between two ranks of a same set of objects on the 
basis of the number of inversions that have to be made to 
turn one ranking into the other; it ranges from — 1 (when two 
rankings are the opposite) to 1 (for equal rankings). 6, which 
has the same range of K T , takes in input two vectors of the 
same length and is equal to 1 if these vectors are perfectly 
scaled (e.g. <9([1, 2, 3], [100, 200, 300]) = 1). 

Furthermore, to give an estimation of how much informa- 
tion is lost with the approximation, we calculated the recall 
value, that is the ratio between the number of arcs in the 
approximated graph and in the theoretical graph. Finally, we 
calculated the portion of arcs, among the set of those arcs that 
are not represented in the approximated graph, whose weight is 
1 in the theoretical model (we refer to this measure as simh). 

The mean values and the standard deviations of the cited 
measures, for some low value of k, are reported in Table [HI] 
The main results obtained are the following: 
A. K T and 9 values, measured on the set of tags which are 
common to the two models, are very high, independently 
on the value of k. This means that retrieved tags in the 
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Fig. 7. Effect of approximation on tag navigation path length 



approximated model are well ordered and proportioned 
compared to the theoretical model. 

B. The value of Recall reveals that, for very small values 
of k, up to the 40% of arcs are not represented in the 
approximated model. Recall grows sub-linearly with k. 

C. The extremely high values of sim% reveals that the 
weight of almost all these missing arcs is 1, which is 
the minimum value in the similarity network. Further 
analyses showed that, for every k, the 99% of the 
missing arcs has a weight < 3. So, the missing arcs 
are positioned in the very tail of the weight ranking. 

In a nutshell, even if correct proportions are kept, the 
number of arcs in the approximated FG can be considerably 
smaller respect to the original graph. Nevertheless, the arcs 
that are not mapped represent very weak similarities. In fact, 
the great majority of these arcs are simply noise caused by 
the insertion of meaningless or singleton tags, which cover a 
high percentage of the overall tags but that are useless during 
the search phase. Therefore, the approximation adopted does 
not affect the quality of the FG, but rather reduces the noise 
on the mapped graphs and eases the load on the p2p layer. 

C. Faceted search 's convergence 

Search convergence is important because the quickest the 
navigation converges, the lowest the number of overlay 
lookups needed to locate a resource is. 

Convergence rapidity depends by which is the first tag 
selected. If it resides in the periphery of the FG, the search 
procedures will converge almost immediately, because the 
number of tags and the number of resources connected with 
it will be very probably quite small. Making a parallel with a 
taxonomic search structure, it is like the user had started his 
search from a node which is very close to a leaf of the tree 
structure, and so he/she had few levels left to explore. 

On the contrary, the dual (and probably more frequent) 
behavior starts the search from more popular tags, those that 
resides into the core. In order to show that convergence is 
quick also in this case we report further simulative results. 
We took the 100 most popular tags and, starting from these, 



we simulated tag search procedures in order to estimate the 
average length of a search. 

Three types of search were performed; independently from 
the search strategy, we suppose that the size of the tag set 
shown to the user at each step, Tj, is upper bounded to the 
top 100 tags retrieved from the DHT; larger sets of tags would 
be unsuitable for an effective user visualization. In the first 
search type (first tag strategy) the tag selected at each step 
is the most similar with the current tag. Formally, given a 
search path to, ti, the next tag selected is a label ti + i such 
that sim(ti,ti + i) > sim(ti,T),VT G Tj. The second type 
(last tag strategy) is the dual of the previous: the selected 
label is always the tag which is the least related with the 
current one among the 100 tags displayed (i.e. tag such 
that sim(ti,ti + i) < sim(tj,r),Vr £ Ti). In the third search 
type (random tag strategy), the next tag is selected uniformly 
random within Tj. 

For each tag among the 100 most popular we simulated the 
"first" and "last" search and 100 random searches, on both 
original and approximated Folksonomy Graph (for k = 1), 
using the faceted search algorithm described in Section IIII-CI 
The search procedure is stopped when \Tj\ reduces to 1 or 
when \Ri\ < 10. We choose 10 as lower threshold for the 
number of displayed resourced because a set of 10 objects is 
small enough to be displayed to the user without the need of 
further filtering. 

Statistics on search paths length are shown in Table [IV] 
From the experiments, it emerges that the path length is 
characterized by a high variance, for every search strategy, 
due to the high variability in the nodal degree of the FG. 
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With regard to searches performed in the original model, 
we observe that in the "last" and "random" strategy, the mean 
(and median) values are very small if compared to the size of 
the dataset; in particular, note that these values are < ln(\T\). 
The "first tag" strategy produces longer paths; however, a 
deeper result inspection revealed that they are originated by 
tag selection sequences which are very unlikely to be produced 
by a real user. 

Roughly, the great majority of such sequences are those 
in which almost all tag selected are the most popular tags; 
since such tags are connected with huge sets of tags and 
resources, the size of the resource and tag sets decreases slowly 
at each search step. This is an expected behavior of the system, 
because if the user does not specialize the search terms it 
is clear that the navigation is maintained at a very coarse- 
grain level. Other slow-converging sequences are those in 
which many synonyms appear (e.g. "electronica", "electronic", 
"electro"). Here, since semantically equivalent tags mark more 
or less the same set of resources, it is clear that navigation from 
one to another does not add any filtering information to the 
search procedure. 

Such categories of search path occur because the meaning of 
the tag is not taken into account in simulations. But when the 
tag navigation is executed by a human user, and a semantic 
thread is followed in tag selection, the path leading to the 
objective could probably result shorter, more similar to the 
"random tag" selection case. 

Comparing the simulative results obtained in the original 
Folksonomy Graph with those obtained for the approximated 
graph, the advantage on query convergence determined by 
approximation is clearly shown. Figure [7] which plots the 
cumulative density function of search path lengths for both 
models in the three strategies, together with statistics of Table 
II VI shows that the approximated approach shortens the navi- 
gation, thus quickening convergence. This effect, particularly 
evident in the "first tag" strategy, is determined by the deletion 
of lightweight arcs from the graph. By wiping out the noisy 
connections the semantic distance between tags is increased, 
thus leading to a faster vocabulary specialization during the 
tag selection process. 

As final consideration, remember that the simulated search 
ends when the set of resources reduces to an arbitrary threshold 
set to 10, but if this value is raised, even slightly, path lengths 
could be considerably reduced. 

VI. Conclusions and future works 

We presented an approximated approach for the mainte- 
nance of a folksonomy graph in order to make feasible a fully 
distributed implementation of a tagging system. In practice, we 
introduce a connection parameter k which acts as an upper 
bound to the number of lookups executed on a DHT based 
system. 

Simulative and analytic studies show that the approximated 
representation of the similarity graph does not upset the 
features of our theoretic Folksonomy Graph model, even 
for k = 1. Besides, approximation can (1) largely mitigates 



overfitting phenomena, (2) significantly reduce the number of 
overlay operations for new tag insertion without degrading the 
user search experience. The information which gets lost in the 
approximated mapping is prevalently noise. 

Furthermore, the property of search navigation acyclity and 
convergence, typical of taxonomical representations, is granted 
by our framework, even if a taxonomy is not explicitly built 
from the flat tag space. The efficiency of tag navigation 
convergence is shown by a simulative experiment on a large 
dataset from Last.fm. The approximated mapping reduces the 
average number of search steps, because the elimination of 
noisy similarity links between tags leads to a more effective 
filtering when new tags are selected during the navigation 
process. The overall approach leads to a better exploitation of 
the DHT layer. The low number of lookups needed during the 
insertion/search phases allows an efficient implementation of a 
tag-based, general-purpose indexing service over a structured 
p2p network. 

Emulative and evolutionary analysis is planned in the next 
future. Indeed, the way in which our system reacts to particular 
evolutions deserves further investigation. In particular, we 
are planning to study if our approximated model hampers 
the emergence of new tagging trends: forthcoming tests will 
address the dynamics of different tag-resource patterns, and 
how the continuous activity of the community of users affects 
the adaptability of our p2p model. 
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